Tracing the Evolution of Physics on the Backbone of Citation Networks 



S. Gualdi\ C. H. Y.-C. Zhang^'^ 

^Department of Physics, University of Fribourg, CH-1700 Fribourg, Switzerland 
^ The Nonlinearity and Complexity Research Group, 
Aston University, Birmingham B4 VET, United Kingdom 
^ Web Sciences Center, School of Computer Science and Engineering, 
University of Electronic Science and Technology of China, Chengdu 610054, P- R- China 

(Dated: August 8, 2011) 

Many innovations are inspired by past ideas in a non-trivial way. Tracing these origins and 
identifying scientific branches is crucial for research inspirations. In this paper, we use citation 
relations to identify the descendant chart, i.e. the family tree of research papers. Unlike other 
spanning trees which focus on cost or distance minimization, we make use of the nature of citations 
and identify the most important parent for each publication, leading to a tree-like backbone of the 
citation network. Measures are introduced to validate the backbone as the descendant chart. We 
show that citation backbones can well characterize the hierarchical and fractal structure of scientific 
development, and lead to accurate classification of fields and sub-fields. 

PACS numbers: 89.75.Hc,89.75.Fb,02.50.-r,05.45.Df 



I. INTRODUCTION 

Many innovations are inspired by past ideas in a non- 
trivial way. Examples in statistical physics include the 
connection between spin glasses and combinatorial prob- 
lems [H 12], the application of critical phenomenon in 
earthquake modeling [3l ^ , and the analyses of disease 
spreading by percolation theory O |6j. To draw these 
connections is easy, but to map individual fields onto a 
descendant chart, i.e. a family tree of research branches 
is more complicated. An even more difficult task is to un- 
cover the macroscopic tree based on the microscopic re- 
lations between publications. Despite the difficulties, the 
descendant charts are crucial for revealing the non-trivial 
connections between branches which stimulates inspira- 
tions. Accurate descendant charts also give a natural 
classification of papers. 

A solid basis to study descendant charts is represented 
by the citation network which can be seen as the origi- 
nal map of scientific development. In recent years, the 
citation and authorship networks have been used to eval- 
uate the impact of academic papers and scientists [71 [S] . 
Though useful informations are retrieved, most studies 
focus on contemporary impact and ignore the intrinsic hi- 
erarchical structure of citations encoding the generation 
of scientific advances. Unlike the horizontal exploration 
in conventional paper classifications (I8j . we explore the 
vertical, i.e. temporal, dimension in citation networks to 
identify the descendant charts of publications. 

At this end we identify a backbone of the citation 
network by removing all but the most relevant citation 
for each paper. The backbone hence results in a tree- 
like structure and is found solely based on citation rela- 
tions with no additional information. Similar concepts of 
spanning trees are extensively studied in transportation 
networks and oscillator networks, as minimum spanning 
trees in terms of traveling cost [9l[T0], and trees that max- 
imize betweenness [TT] or synchronizability [T^. Though 



the citation backbone can be constructed by these defini- 
tions, we see no direct correspondence between them and 
scientific descendant trees. Instead, one should make use 
of the nature of citation relations and identify the impor- 
tant parent and thus the offspring for each paper, which 
constitutes a backbone specific for citation networks. 

In this paper, we identify the descendant chart for pub- 
lications in journals of American Physical Society (APS), 
based on their citation network from year 1893 to 2009. 
Our objectives are three-fold. Firstly, we introduce a 
potential approach to identify the most relevant parent 
(among the set of original references) for each publica- 
tion which leads to a backbone of the citation network. 
Secondly, we introduce measures to validate the citation 
backbones as representative descendant charts and com- 
pare our approach with two other simple procedures (i.e. 
selection of random parent or the longest path to the 
root). Finally, we show that citation backbones possess 
features of hierarchy and self-similarity, and lead to a 
valid classification of papers in linear-time, compared to 
conventional polynomial-time algorithms |161 117] . The 
present work pinpoints the importance of scientific de- 
scendant charts, as well as their intrinsic difference from 
other spanning trees. 



II. METHODS 

To start our analyses, we first denote the references of 
a paper as its parents, and the articles citing the paper as 
its offspring. The set of parents and the offspring of a pa- 
per i are denoted by Vi and Oi with respectively Pi and 
Oi elements. Intuitively, the offspring of an important 
paper should share similar focus introduced by its infiu- 
ential parent. Less relevant parents will by contrast lead 
to a more heterogeneous descendance. We thus quantify 
the impact of parent a on i by la^i = X^i'go ^n' 
where sai is the similarity between i and i' . We refer to 
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FIG. 1: A schematic diagram which shows two peers i' — i'l, i'2 
(shaded) of i rooted from parent a. To compute each s^,"* we 
consider a random walk from i' through papers cited by both 
i' and i. Specularly, to compute each s^^f* we consider a 
random walk from i' through papers citing both i' and i. 



papers in the set of Oa\{i\ as the peers of i rooted in 
a (see Fig. [ijfor an illustration). The higher the overall 
similarity between i and the papers in Oa\{i}, the higher 
the impact of a on i. 

A simple way to measure the similarity between i and 
peer i' is to count the number of their common references, 
i.e. Siii — \Vir\Vi'\. However, this similarity measure fa- 
vors peers with many references, resulting in an impact 
biased towards parents with a large offspring. This sug- 
gests to define a similarity measure based on a random 
walk from the peers to i. We thus consider a 2-step ran- 
dom walk from each peer i' to i which passes through 
their common references, and define a contribution to 
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The superscript represents the authors' interpretations, 
as this similarity is measured through the references cho- 
sen by the authors of i. A second contribution is instead 
given by a random walk through articles citing i and rep- 
resents the readers' interpretation of i. In analogy with 
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As defined in both Eqs. ([I]) and ([2]), the higher the ran- 
dom walk probability from i! to i, the higher the similar- 
ity between i' and i. 

Finally, by combining linearly s^^J"^ and sf"' and sum- 
ming over all the peers rooted in a, we obtain the impact 
of a on i as 



i'eo„\{i} 



(3) 



with / to adjust the relative weights on the two contri- 
butions. The subsequent analysis is simplified by setting 
/ — 0.5 unless otherwise specified. We note that cita- 
tions between peers [13] do not contribute to the above 



measure, as these citations may correspond to relations 
other than similarity. For instance, if many peers rooted 
from parent a cited i, it implies that a complements i 
instead of being merely an influential parent of i. The 
same is true if i cites many peers rooted from a, which 
suggests a being a complement of its peers instead of a 
mere influential parent. 

By keeping only the reference a with highest la^i for 
all i, we obtain a citation backbone denoted as the SIM 
backbone. Cases of equal scores are extremely rare and 
do not affect results (in such situations we arbitrarily 
choose the latest reference with highest la^i). In addi- 
tion, we examine also the RAN wA the L0# backbone, 
which selects respectively a random parent and a refer- 
ence which gives rise to the longest path to the root (most 
likely the latest published parent). Other than serving 
as a benchmark, the RAN backbone can be informative 
as the random parent is one of the original references. 
The LOA^ backbone instead represents a natural choice if 
progress is always based on recent developments, as one 
may follow the step-by-step evolution of science repre- 
sented by the maximum number of steps needed to reach 
the root. 



III. STATISTICAL PROPERTIES OF THE 
BACKBONE 

We will examine the citation network among the jour- 
nals of American Physical Society, from year 1893 to 
2009. The dataset is composed by 4.67 x 10^ citations 
between 4.49 x 10^ publications. In rare cases there are 
references to contemporary or even posterior published 
papers. These citations are removed and the network is 
strictly acyclic. 

We note that all papers without reference are poten- 
tially the roots of the backbone. As this number is in 
general greater than one and we are limited to the sim- 
plest case with one selected ancestor per paper, there 
may appear multiple roots and hence isolated trees in 
the backbone. In the subsequent discussion, we will refer 
the output of the SIM , RAN and LON algorithms as 
backbone, and its isolated components as trees. Since the 
seemingly isolated roots may be connected by citations 
other than the APS network, the number of isolated trees 
would be lower if a more comprehensive citation data was 
used. Nevertheless, such isolated trees may represent a 
crude classification of papers. Table |T] summarizes gen- 
eral statistical properties of the group of trees as obtained 
by SIM , RAN and LON approaches. 



IV. THE STRUCTURE OF THE BACKBONE 

In this section, we will discuss and derive measures to 
validate the citation backbones as representative of de- 
scendant charts. Three aspects will be studied. Firstly, 
we examine the linkage between different generations of 



3 





SIM 


RAN 


LON 


Number of isolated trees 


3953 


6594 


2630 


Size: 








largest tree 


26115 


30358 


428147 


2nd largest tree 


25386 


15697 


592 


3rd largest tress 


21794 


11362 


471 


(At) parent-offspring 


9.5 y 


7.4 y 


2.1 y 



TABLE I: Statistical properties of the isolated trees in the 
SIM , RAN and LON backbones. Values for RAN are aver- 
aged over 100 realizations. We also show the average interval 
(in years) between the date of publication of a paper and its 
parent. 



papers. Secondly, we quantify the paper classification 
as given by the clustering and branching structure in 
the backbones. Finally, we examine the possible self- 
similarity in citation backbones. 



A. Hierarchy 

We first examine the probability of observing an orig- 
inal citation between two papers as a function of their 
distance in the backbone. If the backbone is meaningful 
we expect this quantity to decrease fast as the distance 
increases. To compute the distance between i and j we 
find the first common ancestor a' in the backbone and 
count the number of steps dia> and dja' required to go 
from i to a' and from j to a' . The distance dij is then 
set as dij = dial + dja' ■ We consider dij = oo for paper 
i and j in isolated trees. In Fig. [2] we plot P{l\d) as a 
function of d for aU SIM , RAN and LON backbones, 
where I denotes the presence of a link, i.e. a citation. As 
we can see, P{l\l) = 1 by definition and all ^'(^Id) dis- 
play a power law decay for small d. The SIM backbone 
shows a faster decay than other algorithms, suggesting 
that citations are more localized in the neighborhood of 
a paper in the S/M backbone. A similar quantity P{d\l) 
(see the inset of Fig. [5]) also indicates that the 5'/M back- 
bone is the best representative of the APS network since 
citations are concentrated at = 2 and decay faster as 
the distance increases. 

In addition to P{l\d), we further consider 
P(l\diai ,dja') where a' is again the first common 
ancestor of i and j in the backbone. This allows us to 
see whether citations are localized on the specific branch 
of each paper or spread over different ramifications 
on the tree. For any pair (i,j) we take i as the later 
published paper such that the only potential citation 
is j — >■ j. We show in Fig. [s] (a)-(c) the results of 
P(l\dia' ,djai) for the three backbones, as a function of 
did' and dja'. One notes that increasing dia' on the line 
of dja' ~ corresponds to the vertical trace towards 
the root, while points with dja' 7^ correspond to the 
various 'ramifications' in the backbone. Both SIM and 




FIG. 2: (Color online) Conditional probability P{l\d) of ob- 
serving a citation between two papers at distance d on the 
SIM (red squares), the RAN (black plus) and the LON (green 
circles) backbone. Results for RAN are averaged over 100 re- 
alizations of the backbone (the variance is negligible). Inset: 
conditional probability P{d\l) that two papers are at distance 
d given there is a citation. 



RAN gives a meaningful structure where citations are 
localized on the descendant chart of the immediate and 
next immediate ancestor, i.e. the triangle in the bottom 
left-hand corner. Citations between different ramifica- 
tions are rare. The LON backbone instead displays a 
less coherent structure where citations crossing different 
lines of research are common. To examine the difference 
between SIM and RAN we also show the scaled difference 
of their P{l\dia',dja') as given in the vertical axis of 
Fig. |3](d). This comparison clearly indicates that SIM 
gives raise to the most meaningful hierarchy as citations 
are mainly found on the descendant chart of the more 
relevant ancestors instead of crossing different charts. 



B. Clustering 

In addition to the crude classification as given by the 
isolated trees, the branches in a single tree are also infor- 
mative to identify research fields and sub-fields. From 
the clustering point of view the method we have in- 
troduced is computationally efficient (with complexity 
to be 0{N) as long as connectivity is not extensive) 
compared to modularity maximization based algorithms 
[ini [2D] or hierarchical clustering algorithms [2I4 (with 
complexity at least 0{N^)). Moreover, the clustering 
naturally explores the temporal dimension by preserving 
the ancestor-descendant relations. 

In order to map the backbone into clusters we con- 
sider two simple approaches which involve only a single 
parameter. The first approach makes use of the publi- 
cation year of papers and naturally follows the order of 
publication. We first make a cut at the year Yc such 
that papers printed before Yc are removed. We then con- 
sider each unconnected component as a different branch, 
i.e. a different cluster, in the original backbone, and as a 
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FIG. 3: (Color online) Heat maps which show P{l\dia' , dja') 
as a function of di^' and dj^' for (a) the SIM backbone, (b) 
the RAN backbone and (c) the LON backbone, with cita- 
tion i j. Since papers only cite references published before 
them, the observed dark triangle in i 07V suggests a rather ho- 
mogeneous temporal interval between papers and their best 
LON ancestor, such that citation with dja' > di^' are highly 
improbable. Results for RAN are averaged over 100 realiza- 
tions of the backbone (the variance is negligible), (d) Scaled 
difference of P{l\dia' , dja') between SIM and RAN . 



classification for papers. 

The second approach is dependent on the cluster size 
which we consider to be a typical research branch. Start- 
ing from the leaves of the backbone (i.e. papers with 
no offspring) we trace towards the root until a branch- 
ing point is reached. The branching point is defined as a 
node of the network from which at least (i) two ramifica- 
tions start and (ii) two ramifications are extended more 
than S steps. When a branching point satisfies these re- 
quirements, all ramifications originating from it are con- 
sidered as different branches, resulting in a classification 
of papers. Here we quantify the validity of clustering as 
a function of parameter Yc and S. 

In order to evaluate the quality of a given clustering 
we use two different measures. The first one — which we 
call exclusivity — is a modified modularity measure spe- 
cific for directed acyclic graphs. The rationale behind 
this measure is to compute the fraction of links of the 
original network falling inside the same cluster and com- 
pare it with the expected value for a random directed 
acyclic graph. We denote the set of papers assigned to 
branch x as X and define the exclusivity as 
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Pi 



(4) 



where pi is again the number of references of i; is the 
number of «'s references in branch x; rii is the number 
of papers published before i; n\. is the number of papers 
published before i in branch x. The term n^./rii thus 
corresponds to the expected fraction of links from i to an 



element in X in the random case. To reduce the noise 
from small clusters, we have excluded branches with less 
than 10 papers. 

The second measure we use is the effective number of 
PACS — Np — which counts the average number of het- 
erogeneous PACS in individual branches. Good paper 
classifications result in small values of Np. We first de- 
note Tp to be the fraction of paper in branch x which is 
labeled by the PACS number p, and note that rp>l 
as papers are always labeled by more than one PACS 
number. Np is then defined as 



N. 



1 



^2pif p ) 



(5) 



where fp = fp/J^p' ^p'- Therefore, Np — 1 when there 
is only one PACS in the branch which corresponds to 
the optimal classification of papers. On the other hand, 
Np attains its maximum when all PACS numbers in X 
have equal share (i.e. equal f^) and a large Np thus 
corresponds to high heterogeneity inside single clusters. 
We remark that in evaluating Np, only the first four 
digits are used to distinguish PACS number. 

In Fig. [4] we plot the E and Np as a function of the two 
parameters and S. Both measures are biased by the 
cluster size but in an opposite way. While Np indicates 
better clustering (and thus a lower value) when isolated 
clusters are of smaller size, E indicates better clustering 
(and thus a higher value) when clusters are of larger size. 
Even with the compensation by nf / rii in Eq. ([5]), we still 
observe a small bias of E on cluster size. These biases 
may influence our comparison of the identified clusters 
from the SIM , RAN and LON backbones, as they have 
different sizes. Nevertheless, the combination of the two 
independent measures clearly indicate that SIM is the 
best choice to obtain a meaningful clustering besides the 
bias introduced by cluster sizes. Moreover the exclusiv- 
ity of the SIM backbone is higher for any value of the 
parameter S which further supports the validity of the 
comparison despite the presence of the bias. 



C. Self-similarity 

Other than the hierarchical and clustering properties, 
the backbones may possess self-similarity. Intuitively, 
self-similarity may be induced when branches of research 
successively generate branches of significant advances. 
The existence of fractality in the backbone would provide 
support for its relevance with the evolution of science. 

To show the self-similarity in networks, one can mea- 
sure their fractal dimension by the box-covering method 
[TTl [JH [T5] . In this approach, the fractal dimension d is 
defined as the power-law exponent in 



N{Ib) ^ 1%: 



(6) 



where N{Ib) is the minimum number of boxes, each of 
radius Ib, required to cover the whole network. To ob- 
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FIG. 4: (Color online) The exclusivity E and the effective 
number of PACS Np as a function of cut-year Yc and branch 
depth 5* for the SIM (red squares), the RAN (black plus) and 
the LON (green circles) backbone. Both quantities show that 
SIM gives a more meaningful division into branches. Results 
for RAN are averaged over 100 realizations of the backbone 
(the variance is negligible). 




FIG. 5: (Color online) N{Ib) as a function of is for the largest 
tree in the original network and its SIM , LON and RAN 
backbone, taken as the minimum over 20 random sequences 
of seed nodes for the box-covering algorithms. The SIM back- 
bone are obtained at / = 0.5. Inset: N{Ib) as a function of 
Ib for the three largest trees in the SIM backbone. 

tain the exact N(Ib) is in general difficult, we thus em- 
ploy the random sequential box-covering algorithm |15j 
which gives an approximate N{Ib) with the same scal- 
ing. Specifically, we start with all nodes being "uncov- 
ered" and repeat the following procedures until all nodes 
become "covered": (1) pick randomly a seed node, (2) 
find all "uncovered" nodes within a distance of Ib from 
the seed, and (3) increase N{Ib) by one if there exists 
at least one "uncovered" node and mark all of them as 
"covered" . Note that a "covered" node can also be a seed 
in the subsequent searches. For the same tree, we show 
the minimum of N{Ib) among 20 random sequences as 
our final value for each value oi Ib- 

We show in Fig. [sjthe results of N{Ib) as a function 



of Ib for the largest tree in the SIM , RAN and LON 
backbone. The results are compared to N{Ib) of the 
original citation network. As we can see, N{Ib) from the 
LON backbone has the highest resemblance to power- 
laws, while that of the RAN backbone shows the fastest 
decay in N{Ib)- The LON tree has a long tail of N{Ib), 
as it is longest and largest in size (see Table |l]). Only the 
largest tree of a particular realization of the RAN back- 
bone is shown, as similar results are observed in other 
realizations. Though a long tail is not observed in the 
SIM tree, it shows a power-law-like behavior up to an 
intermediate value of Ib ■ Similar behaviors are also ob- 
served in the other isolated trees of the SIM backbone, 
as shown by the inset of Fig. [5) 

We interpret the results as follows. The observed re- 
semblance to power-laws from the SIM and LON back- 
bone may suggest the presence of self-similarity in their 
descendant chart. While the LON backbone does not 
possess a meaningful hierarchy or clustering compared 
to the SIM backbone, its step-by-step structure indeed 
shows the highest fractality. We note that a rather 
short power-law is also observed in the original network, 
though characterized by a different exponent from the 
SIM and LOA^ backbones. On the other hand, such frac- 
tality is not observed in the RAN backbone. 



V. POTENTIAL APPLICATIONS 

In this section, we briefly describe the implications and 
potential applications of the citation backbone as a de- 
scendant chart of research papers. 

As the backbone is a sketch of the skeleton of scien- 
tific development, it can be applied to identify seminal 
papers. Preliminary results show that a simple measure 
based on the the number of relevant offspring, i.e. fol- 
lowers in the backbone, is sufficient to give a meaningful 
ranking that is not trivially correlated with the original 
number of incoming citations (between the two ranking 
the Kendall's correlation coefficient is 0.19 and there is 
an overlap of only 7 papers in the top 20 ranks). This 
serves as a simple yet meaningful definition of impact of 
a publication. More refined definitions which takes into 
account the reputation of each relevant offspring and/or 
the structural role of a given paper in the backbone can 
give even better selection of fundamental papers. More- 
over, our formulation of tunable weight on authors' and 
readers' interpretation in Eq. (|3| can be easily incorpo- 
rated in common ranking algorithms such as Page Rank 
where an even repartition of citation importance is in- 
stead assumed. 

The second application corresponds to the classifica- 
tion of papers. As we have mentioned before, such clus- 
tering divides papers into research fields or sub-fields and 
offers a basis for a synthetic picture of the state-of-the- 
art. There are several advantages over conventional clas- 
sifications, which include (1) lower computational com- 
plexity, (2) additional information of sub-clustering as 
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given by the internal tree structure, (3) predictions of fu- 
ture development by considering the rate of growth of 
sub-branches. Especially this last feature is useful to 
filter the most active directions in the large amount of 
literature at our disposal. 

VI. CONCLUSIONS 

We have shown that a simple backbone constructed 
by the most relevant citations can well characterize the 

original citation network. Conversely, non-trivial infor- 
mations stored in the citation network can be simply ex- 
tracted from its backbone. While conventional spanning 

trees are based on contemporary information, we demon- 
strated the significance of temporal dimension in citation 
backbones. 

Specifically, we have introduced both a simple ap- 
proach to identify the most relevant reference for each 
publication and effective measures to quantify the va- 
lidity of the resulting backbone. Our results show that 
the essential features of hierarchy and paper clustering 
in the original network are well captured by our cita- 
tion backbone, while this is not the case for other simple 
approaches. On the other hand, we showed that resem- 
blance to self-similarity is observed in citation backbones. 

In terms of applications, the backbone can be consid- 
ered as a descendant chart of research papers, which con- 



stitutes a useful basis for identifying seminal papers and 
paper clusters, and in general a synthetic picture of differ- 
ent research fields. In particular, paper classification by 
mean of the backbone is computationally efficient when 
compared to the conventional clustering approaches, and 
provides additional information on the cluster structure 
besides a mere cluster label. 

While we only investigated the citation network of the 
American Physical Society, the same approach can be 
readily applied to other citation networks. It would be 
also interesting to examine the potentials of the present 
approach on other directed acyclic graphs. 
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